Communicative efficiency and syntactic predictability: A cross- linguistic study based on the Universal Dependencies corpora

نویسنده

Natalia Levshina

چکیده

There is ample evidence that human communication is organized efficiently: more predictable information is usually encoded by shorter linguistic forms and less predictable information is represented by longer forms. The present study, which is based on the Universal Dependencies corpora, investigates if the length of words can be predicted from the average syntactic information content, which is defined as the average information content of a word given its counterpart in a dyadic syntactic relationship. The effect of this variable is tested on the data from nine typologically diverse languages while controlling for a number of other well-known parameters: word frequency and average word predictability based on the preceding and following words. Poisson generalized linear models and conditional random forests show that the words with higher average syntactic informativity are usually longer in most languages, although this effect is often found in interactions with average information content based on the neighbouring words. The results of this study demonstrate that syntactic predictability should be considered as a separate factor in future work on communicative efficiency. 1 Research hypothesis It is well known that more predictable information tends to be presented by shorter forms and less coding material, whereas less predictable information is expressed by longer forms and more coding material. This formfunction mapping allows for efficient communication. A famous example is the inverse correlation between the frequency of a linguistic unit and its length discovered by Zipf (1935[1968]). The main cause is an underlying law of economy, saving time and effort (Ibid: 38). In the domain of grammar, Greenberg (1966) provided substantial cross-linguistic evidence that relative frequencies of unmarked members of grammatical categories (e.g. singular number or present tense) are more frequent than their marked counterparts (e.g. dual/plural or future/past, respectively). This idea has been developed further by Haspelmath (2008), who provides numerous examples of coding asymmetries in which the more frequent morphosyntactic forms are shorter than the functionally comparable less frequent ones. These asymmetries can be explained by the tendency of language users to make communication efficient: “The overall number of formal units that speakers need to produce in communication is reduced when the more frequent and expected property values are assigned zero” (Hawkins, 2014: 16). While the accounts mentioned above are based on context-free probability of linguistic units, some other approaches, which go back to Shannon’s (1948) information theory, take into consideration the conditional probability of a unit given its context. The measures computed from these conditional probabilities are often called information content, surprisal, or informativity. There is ample evidence of ‘online’ word reduction in speech production based on contextual predictability (e.g. Aylett and Turk, 2004; Bell et al., 2009). In addition, one has found ‘offline’ effects of average informativity on formal length in written corpora: the more predictable a word is on average, the shorter it is (Piatandosi et al., 2011). One of the explanations of such correlations is known as the hypothesis of Uniform Information Density (Levy and Jaeger, 2007), which says that information tends to be distributed uniformly across the speech

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

An annotation scheme for Persian based on Autonomous Phrases Theory and Universal Dependencies

A treebank is a corpus with linguistic annotations above the level of the parts of speech. During the first half of the present decade, three treebanks have been developed for Persian either originally or subsequently based on dependency grammar: Persian Treebank (PerTreeBank), Persian Syntactic Dependency Treebank, and Uppsala Persian Dependency Treebank (UPDT). The syntactic analysis of a sen...

متن کامل

Syntactic Structures and Rhetorical Functions of Electrical Engineering, Psychiatry, and Linguistics Research Article Titles in English and Persian: A Cross-linguistic and Cross-disciplinary Study

A research article (RA) title is the first and foremost feature that attracts the reader's attention, the feature from which she/he may decide whether the whole article is worth reading. The present study attempted to investigate syntactic structures and rhetorical functions of RA titles written in English and Persian and published in journals in three disciplines of Electrical Engineering, Psy...

متن کامل

The Predictive Power of Syntactic Knowledge, Vocabulary Breadth and Metacognitive Strategies for L2 Reading Fluency

Fluent reading is a multifaceted ability that integrates several linguistic and non-linguistic processes. Accordingly, recognizing the critical components of fluent reading is highly significant in planning and implementing effective reading programs. This study was undertaken to evaluate the predictive power of syntactic knowledge, vocabulary breadth, and metacognitive awareness of reading str...

متن کامل

Syntactic Complexity of Russian Unified State Exam Texts in English: A Study on Reliability and Validity

In this study we analyze texts used in Russian Unified State Exam on English language. Texts that formed small research corpora were retrieved from 2 resources: official USE database as a reference point, and popular website used by pupils for USE training “Neznaika” (https://neznaika.pro/). The size of two corpora is balanced: USE has 11934 tokens and “Neznaika” - 11918 tokens. We share Biber’...

متن کامل

Cross-linguistic Influence at Syntax-pragmatics Interface: A Case of OPC in Persian

Recent research in the area of Second Language Acquisition has proposed that bilinguals and L2 learners show syntactic indeterminacy when syntactic properties interface with other cognitive domains. Most of the research in this area has focused on the pragmatic use of syntactic properties while the investigation of compliance with a grammatical rule at syntax-related interfaces has not received...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2017

Communicative efficiency and syntactic predictability: A cross- linguistic study based on the Universal Dependencies corpora

نویسنده

چکیده

منابع مشابه

An annotation scheme for Persian based on Autonomous Phrases Theory and Universal Dependencies

Syntactic Structures and Rhetorical Functions of Electrical Engineering, Psychiatry, and Linguistics Research Article Titles in English and Persian: A Cross-linguistic and Cross-disciplinary Study

The Predictive Power of Syntactic Knowledge, Vocabulary Breadth and Metacognitive Strategies for L2 Reading Fluency

Syntactic Complexity of Russian Unified State Exam Texts in English: A Study on Reliability and Validity

Cross-linguistic Influence at Syntax-pragmatics Interface: A Case of OPC in Persian

عنوان ژورنال:

اشتراک گذاری